July 26, 2016
Neural random forests
Deep neural decision forests
A decision tree is a logical model represented as a tree that shows how the value of a target variable can be predicted by using the values of a set of predictor (input) variables.
A decision tree recursively partitions the data and sample space to construct the predictive model.
Its name derives from the practice of displaying the partitions as a decision tree, from which the roles of the predictor variables may be inferred.
A classification or regression tree is a prediction model that can be represented as a decision tree.
A tree-structured classifier is a decision tree for predicting a class variable from one or more predictor variables.
A regression tree model is a nonparametric estimate of a regression function constructed by recursively partitioning a data set with the values of its predictor variables.
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields.
Morgan and Sonquist (1964, JASA)
Morgan and Sonquist (1964, JASA)
Morgan and Sonquist (1964, JASA)
An empirical tree represents a segmentation of the data that is created by applying a series of split rules. Each split rule assigns an observation to a segment based on the values of inputs.
One rule is applied after another, resulting in a hierarchy of segments within segments. The hierarchy is called a tree, and each segment is called a node.
The original segment contains the entire data set and is called the root node of the tree. A node with all its successors forms a branch of the node that created it.
The nodes that have child nodes are called interior (or internal, intermediate) nodes. The final nodes that do not have child nodes are called leaves (or terminal nodes). For each leaf, a decision is made and applied to all observations in the leaf.
The type of decision depends on the context. In predictive modeling, the decision is simply the predicted value.
Split rule: determined by examining all possible splits (exhaustive search) or performing statistical tests
Good split rule: the data being homogeneous within each subnode and heterogeneous between subnodes.
Loh, W.-Y. (2008). Regression by parts: Fitting visually interpretable models with GUIDE. In Handbook of Data Visualization, C.Chen, W.Härdle, and A.Unwin, Eds. Springer, pp.447-469.
Loh, W.-Y. (2008). Classification and regression tree methods. In Encyclopedia of Statistics in Quality and Reliability, F.Ruggeri, R.Kenett, and F.W. Faltin, Eds. Wiley, Chichester, UK, pp.315-323.
Loh, W.-Y. (2010). Tree-structured classifiers. Wiley Interdisciplinary Reviews: Computational Statistics, 2, 364-369.
Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 14-23.
Loh, W.-Y. (2014). Fifty years of classification and regression trees (with discussion). International Statistical Review.
Merkle, E.C. and Shaffer, V.A. (2011). Binary recursive partitioning: Background, methods, and application to psychology, British Journal of Mathematical and Statistical Psychology, 64, 161–181.
Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc. 58 415–434.
Strobl, C., Malley, J. and Tutz, G. (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random forests. Psychological Methods, 14(4), 323–348.
Piecewise constant least squares models: AID (Morgan and Sonquist, 1964), CART, RPART, GUIDE (Loh, 2002)
Piecewise linear least squares, polynomial regression, quantile regression, and longitudinal and multi-responses data effects: Loh (2014) with discussion
Subgroup identification of differential treatment effects: Loh et al. (2014), Hothorn et al. (2014)
Others: M5 (Quinlan, 1992), MOB (Zeileis et al., 2008), MELT (Eo and Cho, 2014), KAPS (Eo et al., 2015)
Missing values, selection bias, accuracy, speed, and tree complexity
Diagnostic tool: Simonoff (2013)
Basic notations
Enable tree construction when there are missing values in the learning sample
Enable classification of new cases with missing values
Rank variables by their order of importance (not available in rpart)
Detect masking of variables
Biased toward variables with more splits: A k-valued ordered variable has (k − 1) splits; a k-valued categorical variable has (2k−1 − 1) splits.
Biased toward predictors with more missing values: Split method uses only proportions of nonmissing cases—it ignores the number of missing values. A variable taking a unique value for exactly one case in each class and missing on all other cases yields the largest decrease in impurity. Bias exists for surrogate splits too.
Computation: Impractical when there are three or more classes and categorical variables with many values. Note: Because CART and RPART encode each categorical variable split with a 32-bit binary integer, they do not properly deal with categorical variables having more than 32 values.
Prediction accuracy: Often no better than linear discriminant analysis.
For more details, See the talk slide.
See the JCGS paper
At each node, a case goes to the left child node if stated condition is satisfied. Sample sizes are beside terminal nodes.
95% bootstrap intervals for relative risk of treatment vs. placebo below nodes.
Identifying groups of patients for whom the treatment has a different effect than for others. Effect is:
– Stronger
– Lower
– Contrary
than the average treatment effect.
Suitable models promise better prediction of treatment effect and thus individualised treatments.
Regression trees are natural for subgroup identification
Examine residual patterns for each treatment level
Test for treatment interaction
Perturb population instead of dat
See partykit package in order to build a tree-structured model
tree::treerpart::rpartmvpart::mvpartparty::mobpsychotree:psychotreebetareg::betatreeRWeka::M5evtree::evtreeREEMtree::REEMtreevcrpart::vcrpartkaps::kaps melt::melt
History of Random Forest (RF)
Figure 1 of Boulesteix et al. (2012)
There are three parameters for RF algorithms:
By default, in the regression mode of the randomForest package, the parametermtry = ceiling(p / 3), a_n = n, and nodesize = 5.
Algorithm 1 of Biau and Scornet (2016)
RF can be used to rank the importance of variables via two measures of significance:
For more detail, see [Goldstein et al. (2014)].
Set \({\bf X} = (X^{(1)}, \ldots, X^{(p)}).\)
For a forest resulting from the aggregation of \(M\) trees, the MDI of the variable \(X^{(j)}\) is defiend by
\[
\hat{MDI}(X^{(j)}) = \frac{1}{M} \sum_{l = 1}^{M} \sum_{t \in \mathrf{T}_l} p_{n,t} L_{reg, n} (j_{n,t}, z_{n,t}),
\] where
For more details, See useR 2008 talk slide.
For more details, See [the JSCS paper].
Online learning is a method in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training dataset at once
One of the strengths of RF is that they can handle missing data.
RF can naturally be adapted to fit the unbalanced data framework by down-sapling the majority class and growing each tree on a more balanced dataset (Chen et al. 2004; Kuhn and Johnson, 2013).
Fink et al. (2010) use RF for which each tree is traned and allowed to prediction on a particular region in space and time.

One-class classification (unary classification) tries to identify objects of a specific class amongst all objects, by learning from a training set containing only the objects of that class.
| Tree-structured Models | Random Forests |
|---|---|
tree::tree |
|
rpart::rpart |
randomForest::rf |
mvpart::mvpart |
randomForestSRC |
party::mob |
Rborist |
psychotree:psychotree |
ranger::ranger |
betareg::betatree |
party::cforest |
RWeka::M5 |
quantregForest:: |
evtree::evtree |
LogicForest:: |
REEMtree::REEMtree |
|
vcrpart::vcrpart |
|
melt::melt |
For more details, see CRAN Task View
- RF can be viewed as weighted neighborhoods schemes.
+ \(\hat{y} = \sum_{i=1}^{n} W(x_i, x') y_i\).
+ where \(W(x_i, x')\) is the non-negative weight of the \(i\)th training point relative to the new point \(x'\).
+ In \(k\)-NN, the weight \(W(x_i, x') = \frac{1}{k}\) if \(x_i\) is one of the \(k\) points closest to \(x'\), and zero otherwise.
+ In RF, \(W(X_i, x') = \frac{1}{k'}\) if \(x_i\) is one of the \(k'\) points in the same leaf as \(x'\), and zero otherwise.
+ Since a forest averages the predictions of a set of m trees with individual weight functions \(\hat{y} = \sum_{i=1} (\frac{1}{m}\sum_j=1}^{m} W_j(x_i, x') )\).
This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of x' in this interpretation are the points \(x_{i}\) which fall in the same leaf as x' in at least one tree of the forest. In this way, the neighborhood of x' depends in a complex way on the structure of the trees, and thus on the structure of the training set.
For more detail, See Lin and Jeon (2006).
See the talk slide